Abdullah Imran

Will You Survive on Titanic? 🚢¶

The Titanic prediction model aims to predict whether a passenger survived or did not survive the sinking of the Titanic, based on various features provided in the dataset. This dataset is commonly used for binary classification tasks in machine learning.

Dataset Description¶

The dataset used for this prediction model is available on Kaggle and can be found at the following link: Titanic - Machine Learning from Disaster.

Here is a brief description of the columns in the dataset:

  • Pclass: Passenger class (1 = 1st class, 2 = 2nd class, 3 = 3rd class)
  • Sex: Gender of the passenger (0 = female, 1 = male)
  • Age: Age of the passenger
  • SibSp: Number of siblings or spouses aboard the Titanic
  • Parch: Number of parents or children aboard the Titanic
  • Fare: Ticket fare paid by the passenger
  • Embarked: Port of Embarkation (0 = Cherbourg, 1 = Queenstown, 2 = Southampton)
  • Survived: Survival status (0 = did not survive, 1 = survived)

Titanic

For Prediction:¶

  • I've provided a brief description of the Titanic prediction model and the dataset.
  • The get_user_input function prompts the user to input values for each feature.
  • The predict_survival function takes the input data and uses the trained model to predict the survival status of a passenger.
  • An image of the Titanic is displayed using a markdown image link.
  • A link to the dataset on Kaggle is included for reference.

1. Importing Necessary Libraries 📚¶

In [1]:
# Loading, Preprocessing, Analysis Libraries
import pandas as pd
import numpy as np

# Visulaiztion Libraries
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
%matplotlib inline


# Model Training And Testing libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler,StandardScaler, LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score, precision_recall_curve, auc, confusion_matrix, recall_score, accuracy_score, precision_score, f1_score, classification_report, f1_score, roc_curve 
# Models Algorithms
from sklearn.ensemble import AdaBoostClassifier, HistGradientBoostingClassifier, RandomForestClassifier, GradientBoostingClassifier
# Profiling Libraries
from ydata_profiling import ProfileReport

2. Loading Dataset 📊¶

In [2]:
titanic = pd.read_csv(r"D:\Projects\Python\CodeSoft Internship\Titanic_Survival Project\Titanic-Dataset.csv")

2.1 Profile Report¶

In [10]:
ProfileReport(titanic, title='Titanic Dataset', explorative=True).to_file(r"D:\Projects\Python\CodeSoft Internship\Titanic_Survival Project\Titanic-Dataset_Profile.html")
Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]
Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]
Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]
Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]
In [4]:
titanic.head()
Out[4]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

3. Exploratory Data Analysis 📉¶

3.1 | Info of Dataset¶

In [3]:
titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     891 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

3.2 Shape of Dataset¶

In [14]:
titanic.shape
Out[14]:
(891, 12)
In [29]:
titanic.columns
Out[29]:
Index(['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare',
       'Embarked'],
      dtype='object')

3.3 Statistical Summary¶

In [10]:
titanic.describe()
Out[10]:
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 891.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 13.002015 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 22.000000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 29.699118 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 35.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
In [7]:
titanic.select_dtypes(include=['object']).describe()
Out[7]:
Name Sex Ticket Cabin Embarked
count 891 891 891 204 891
unique 891 2 681 147 3
top Braund, Mr. Owen Harris male 347082 B96 B98 S
freq 1 577 7 4 645

3.4 Percent Counts of Each Feature¶

In [13]:
def percent_counts(df, feature):
    total = df[feature].value_counts(dropna=False)
    percent = round(df[feature].value_counts(dropna=False, normalize=True) * 100, 2)
    percent_count = pd.concat([total, percent], keys=['Total', 'Percentage'], axis=1)
    return percent_count
In [14]:
percent_counts(titanic, 'Embarked')
Out[14]:
Total Percentage
S 645 72.39
C 168 18.86
Q 78 8.75
In [ ]:
percent_counts(titanic, 'Embarked')
In [16]:
percent_counts(titanic, 'Sex')
Out[16]:
Total Percentage
male 577 64.76
female 314 35.24
In [17]:
percent_counts(titanic, 'Cabin')
Out[17]:
Total Percentage
NaN 687 77.10
C23 C25 C27 4 0.45
G6 4 0.45
B96 B98 4 0.45
C22 C26 3 0.34
... ... ...
E34 1 0.11
C7 1 0.11
C54 1 0.11
E36 1 0.11
C148 1 0.11

148 rows × 2 columns

In [18]:
percent_counts(titanic, 'Survived')
Out[18]:
Total Percentage
0 549 61.62
1 342 38.38
In [19]:
percent_counts(titanic, 'Pclass')
Out[19]:
Total Percentage
3 491 55.11
1 216 24.24
2 184 20.65
In [20]:
percent_counts(titanic, 'SibSp')
Out[20]:
Total Percentage
0 608 68.24
1 209 23.46
2 28 3.14
4 18 2.02
3 16 1.80
8 7 0.79
5 5 0.56

3.5 Separating Continuous and Categorical Variables¶

In [11]:
continuous_values = []
categorical_values = []

for column in titanic.columns:
    if titanic[column].dtype == 'int64' or titanic[column].dtype == 'float64':
        continuous_values.append(column)
    else:
        categorical_values.append(column)
In [58]:
print("Continuous values: ", continuous_values)
print("Categorical values: ", categorical_values)
Continuous values:  ['PassengerId', 'Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']
Categorical values:  ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']

4. Visualization¶

Pclass¶

In [54]:
sns.countplot(x='Pclass', data=titanic)
plt.xlabel('Pclass')
plt.ylabel('Count')
plt.xticks([0, 1, 2], ['First', 'Second', 'Third'])
plt.title('Pclass Count')
plt.show()

Sex¶

In [56]:
sns.countplot(x='Sex', data=titanic, palette=colors)
plt.xlabel('Sex')
plt.ylabel('Count')
plt.xticks([0, 1], ['Female', 'Male'])
plt.title('Sex Count')
plt.show()

SibSp¶

In [110]:
sns.countplot(x='SibSp', data=titanic, palette=colors)
plt.xlabel('SibSp')
plt.ylabel('Count')
plt.xticks([0, 1, 2, 3], ['0', '1', '2', '3'])
plt.title('SibSp Count')
plt.show()

Emabrked¶

In [63]:
titanic['Embarked'].value_counts()
Out[63]:
S    646
C    168
Q     77
Name: Embarked, dtype: int64
In [65]:
sns.countplot(x='Embarked', data=titanic, palette=colors)
plt.xlabel('Embarked')
plt.ylabel('Count')
plt.xticks([0, 1, 2], ['Southampton', 'Cherbourg', 'Queenstown'])
plt.title('Embarked Count')
plt.show()

Fares¶

In [67]:
titanic['Fare'].value_counts()
Out[67]:
65.6344    116
8.0500      43
13.0000     42
7.8958      38
7.7500      34
          ... 
6.8583       1
34.6542      1
12.6500      1
12.0000      1
10.5167      1
Name: Fare, Length: 204, dtype: int64
In [74]:
plt.figure(figsize=(10, 6))
sns.histplot(titanic['Fare'], bins=30, kde=False, color= colors[0])
plt.title('Distribution of Fares')
plt.xlabel('Fare')
plt.ylabel('Frequency')
plt.show()

Age¶

In [83]:
sns.histplot(data= titanic, x='Age', bins=30, kde=True)
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

4.1 | Correlation of Numeric Features¶

In [79]:
df_corr= titanic[continuous_values].corr()
df_corr
Out[79]:
PassengerId Survived Pclass Age SibSp Parch Fare
PassengerId 1.000000 -0.005007 -0.035144 0.035533 -0.072778 NaN 0.003243
Survived -0.005007 1.000000 -0.338481 -0.065857 0.031434 NaN 0.317430
Pclass -0.035144 -0.338481 1.000000 -0.330962 0.023180 NaN -0.715300
Age 0.035533 -0.065857 -0.330962 1.000000 -0.251585 NaN 0.137498
SibSp -0.072778 0.031434 0.023180 -0.251585 1.000000 NaN 0.349615
Parch NaN NaN NaN NaN NaN NaN NaN
Fare 0.003243 0.317430 -0.715300 0.137498 0.349615 NaN 1.000000
In [80]:
plt.figure(figsize=(19,7))
sns.heatmap(df_corr, annot = True)
plt.title('Correlation Matrix of Continuous Variables')
plt.show()

4.2 | Correlation With Target¶

In [113]:
# Assuming titanic DataFrame is already defined
df1 = titanic.copy(deep=True)

# Select only numeric columns
numeric_cols = df1.select_dtypes(include=['number']).columns

# Calculate correlations with the 'Survived' column
corr = df1[numeric_cols].drop('Survived', axis=1).corrwith(df1['Survived'], numeric_only=True).sort_values(ascending=False).to_frame()
corr.columns = ['Correlations']

# Plot the heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, linewidths=0.4, linecolor='black', fmt='.2f', cmap='coolwarm')
plt.title('Correlation with Survived')
plt.show()
In [52]:
titanic['Survived'].value_counts()
Out[52]:
0    549
1    342
Name: Survived, dtype: int64

4.3 | Survival Distribution¶

In [21]:
colors = ["#8B0000", "#FFDAB9", "#8B008B"]

# Calculate the percentage of survival and not survival
l = list(titanic['Survived'].value_counts())
circle = [l[1] / sum(l) * 100, l[0] / sum(l) * 100]

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(12, 5))  # Adjust figsize here

# Pie chart
plt.subplot(1, 2, 1)
plt.pie(circle, labels=['Survived', 'Not Survived'], autopct='%1.1f%%', startangle=90,
        explode=(0.1, 0), colors=colors[:2], wedgeprops={'edgecolor': 'black', 'linewidth': 1, 'antialiased': True})
plt.title('Survival %')

# Count plot
plt.subplot(1, 2, 2)
sns.countplot(x='Survived', data=titanic, palette=colors[:2])
plt.xlabel('Survival')
plt.ylabel('Count')
plt.xticks([0, 1], ['Not Survived', 'Survived'])
plt.title('Cases of Survival')

plt.tight_layout()  # Adjust subplot parameters to give specified padding
plt.show()

4.4 | Survival by Gender, Pclass and Embarked¶

In [22]:
# Prepare data
p2 = titanic.groupby(['Survived', 'Sex']).size().unstack(fill_value=0)
p3 = titanic.groupby(['Survived', 'Pclass']).size().unstack(fill_value=0)
p4 = titanic.groupby(['Survived', 'Embarked']).size().unstack(fill_value=0)

# Create figures
fig = make_subplots(rows=1, cols=1, subplot_titles=("Gender Distribution"), horizontal_spacing=0.05, vertical_spacing=0.05)
fig3 = make_subplots(rows=1, cols=1, subplot_titles=("Pclass Distribution"), horizontal_spacing=0.05, vertical_spacing=0.05)
fig4 = make_subplots(rows=1, cols=1, subplot_titles=("Embarked Distribution"), horizontal_spacing=0.05, vertical_spacing=0.05)

# Plot 1 - Gender Distribution
colors2 = ['#646782', '#CDD5DE']
for i, gender in enumerate(p2.columns):
    fig.add_trace(go.Bar(x=p2.index, y=p2[gender], name=gender, marker_color=colors2[i]), row=1, col=1)

# Plot for Pclass Distribution
colors3 = ['#FF9999', '#66CCCC', '#339966']
for i, pclass_type in enumerate(p3.columns):
    fig3.add_trace(go.Bar(x=p3.index, y=p3[pclass_type], name=pclass_type, marker_color=colors3[i]), row=1, col=1)

# Plot for Embarked Distribution
colors4 = ['#FFA07A', '#20B2AA', '#778899', '#8A2BE2']  # Define colors4
for i, embarked_type in enumerate(p4.columns):
    fig4.add_trace(go.Bar(x=p4.index, y=p4[embarked_type], name=embarked_type, marker_color=colors4[i]), row=1, col=1)

# Update layout for Gender Distribution
fig.update_layout(showlegend=True, barmode='group', bargap=0.15, legend_title_text="Gender", height=400, width=800, plot_bgcolor='rgba(255, 255, 255, 0.7)')
fig.update_xaxes(title_text="Survival", row=1, col=1)
fig.update_yaxes(title_text="Frequency", row=1, col=1)

# Update layout for Pclass Distribution
fig3.update_layout(showlegend=True, barmode='group', bargap=0.15, legend_title_text="Pclass Type", height=400, width=800, plot_bgcolor='rgba(255, 255, 255, 0.7)')
fig3.update_xaxes(title_text="Survival", row=1, col=1)
fig3.update_yaxes(title_text="Frequency", row=1, col=1)

# Update layout for Embarked Distribution
fig4.update_layout(showlegend=True, barmode='group', bargap=0.15, legend_title_text="Embarked", height=400, width=800, plot_bgcolor='rgba(255, 255, 255, 0.7)')
fig4.update_xaxes(title_text="Survival", row=1, col=1)
fig4.update_yaxes(title_text="Frequency", row=1, col=1)

# Show plots
fig.show()
fig3.show()
fig4.show()

4.5 | Embarked & Sex¶

In [92]:
# Calculate the value counts for 'Embarked' and 'Sex' and reset the index
Em_sex = titanic[['Embarked', 'Sex']].value_counts().reset_index(name='count')

Em_sex
Out[92]:
Embarked Sex count
0 S male 441
1 S female 205
2 C male 95
3 C female 73
4 Q male 41
5 Q female 36
In [93]:
plt.figure(figsize=(7,6))
sns.barplot(data=Em_sex , x=Em_sex['Embarked'], y=Em_sex['count'], hue=Em_sex['Sex'])
plt.title('Embarked & Sex Frequency')
plt.xlabel('Embarked')
plt.ylabel('Frequency')
plt.show()

4.6 | SibSp & Survived¶

In [106]:
sv_sibling = titanic[['Survived', 'SibSp']].value_counts().reset_index(name='count')
sv_sibling
Out[106]:
Survived SibSp count
0 0 0.0 398
1 1 0.0 210
2 1 1.0 112
3 0 1.0 97
4 0 2.5 39
5 0 2.0 15
6 1 2.0 13
7 1 2.5 7
In [107]:
plt.figure(figsize=(8,6))
sns.barplot(data=sv_sibling , x=sv_sibling['Survived'], y=sv_sibling['count'], hue=sv_sibling['SibSp'])
plt.title('Survived & SibSp Frequency')
plt.legend(loc='upper right')
plt.xlabel('Survived')
plt.ylabel('Frequency')
plt.show()

5. Data Preprocessing (Cleaning) 🧹¶

5.1 | Detecting Outliers¶

In [ ]:
def outlier_detect(df, col):
    q1 = df[col].quantile(0.25)
    q3 = df[col].quantile(0.75)
    iqr = q3 - q1
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    return df[(df[col] < lower_bound) | (df[col] > upper_bound)]

def outlier_detect_normal(df, col):
    mean = df[col].mean()
    std_dev = df[col].std()
    return df[((df[col] - mean) / std_dev).abs() > 3]

def lower_outlier(df, col):
    q1 = df[col].quantile(0.25)
    iqr = df[col].quantile(0.75) - q1
    lower_bound = q1 - 1.5 * iqr
    return df[df[col] < lower_bound]

def upper_outlier(df, col):
    q3 = df[col].quantile(0.75)
    iqr = q3 - df[col].quantile(0.25)
    upper_bound = q3 + 1.5 * iqr
    return df[df[col] > upper_bound]

def replace_upper(df, col):
    q3 = df[col].quantile(0.75)
    iqr = q3 - df[col].quantile(0.25)
    upper_bound = q3 + 1.5 * iqr
    df[col] = df[col].clip(upper=upper_bound)
    print(f'Outliers in column {col} replaced with upper bound ({upper_bound})')

def replace_lower(df, col):
    q1 = df[col].quantile(0.25)
    iqr = df[col].quantile(0.75) - q1
    lower_bound = q1 - 1.5 * iqr
    df[col] = df[col].clip(lower=lower_bound)
    print(f'Outliers in column {col} replaced with lower bound ({lower_bound})')
    
    
Q1 = titanic.quantile(0.25, numeric_only=True)
Q3 = titanic.quantile(0.75, numeric_only=True)
IQR = Q3 - Q1
for i in range(len(continuous_values)):
    print("IQR => {}: {}".format(continuous_values[i], outlier_detect(titanic, continuous_values[i]).shape[0]))
    print("Z_Score => {}: {}".format(continuous_values[i], outlier_detect_normal(titanic, continuous_values[i]).shape[0]))
    print("********************************")

    
outlier = []
for i in range(len(continuous_values)):
    if outlier_detect(titanic[continuous_values],continuous_values[i]).shape[0] !=0:
        outlier.append(continuous_values[i])

outlier

for i in range(len(outlier)):
    replace_upper(titanic, outlier[i]) 
    
print("\n********************************\n")
for i in range(len(outlier)):
    replace_lower(titanic, outlier[i])

5.2 | Null Value Treatment¶

In [6]:
titanic.isnull().sum()
Out[6]:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
In [7]:
titanic['Age'].fillna(titanic['Age'].mean(), inplace=True)
titanic['Embarked'].fillna("Q", inplace=True)
titanic.isnull().sum()
Out[7]:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
dtype: int64
In [16]:
titanic.drop(labels=['Cabin', 'Name', 'Ticket', 'PassengerId'], axis=1, inplace=True)
titanic.head()
Out[16]:
Survived Pclass Sex Age SibSp Parch Fare Embarked
0 0 3 1 22.0 1 0 7.2500 2
1 1 1 0 38.0 1 0 71.2833 0
2 1 3 0 26.0 0 0 7.9250 2
3 1 1 0 35.0 1 0 53.1000 2
4 0 3 1 35.0 0 0 8.0500 2

5.4 | Duplicates Removal¶

In [68]:
titanic.duplicated().sum()
Out[68]:
0

6. Feature Engineering¶

In [9]:
selected_columns = [ 'Sex', 'Embarked']

6.1 | Label Encoding¶

In [17]:
le = LabelEncoder()
for col in selected_columns:
    titanic[col] = le.fit_transform(titanic[col])

titanic.head()
Out[17]:
Survived Pclass Sex Age SibSp Parch Fare Embarked
0 0 3 1 22.0 1 0 7.2500 2
1 1 1 0 38.0 1 0 71.2833 0
2 1 3 0 26.0 0 0 7.9250 2
3 1 1 0 35.0 1 0 53.1000 2
4 0 3 1 35.0 0 0 8.0500 2

6.2 | MinMax Scaler¶

In [18]:
mms = MinMaxScaler()

df1 = titanic.copy(deep=True)
df1[['Age', 'Fare']] = mms.fit_transform(df1[['Age', 'Fare']])

df1.head()
Out[18]:
Survived Pclass Sex Age SibSp Parch Fare Embarked
0 0 3 1 0.271174 1 0 0.014151 2
1 1 1 0 0.472229 1 0 0.139136 0
2 1 3 0 0.321438 0 0 0.015469 2
3 1 1 0 0.434531 1 0 0.103644 2
4 0 3 1 0.434531 0 0 0.015713 2
In [134]:
titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Sex       891 non-null    int64  
 3   Age       714 non-null    float64
 4   SibSp     891 non-null    int64  
 5   Parch     891 non-null    int64  
 6   Fare      891 non-null    float64
 7   Embarked  891 non-null    int64  
dtypes: float64(2), int64(6)
memory usage: 55.8 KB

7. | Train/Test Split¶

In [25]:
features = df1.drop('Survived', axis=1).values
target = df1['Survived'].values
x_train, x_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)
print("Training set features shape:", x_train.shape)
print("Testing set features shape:", x_test.shape)
print("Training set target shape:", y_train.shape)
print("Testing set target shape:", y_test.shape)
Training set features shape: (712, 7)
Testing set features shape: (179, 7)
Training set target shape: (712,)
Testing set target shape: (179,)
In [60]:
# Creating DataFrames for saving to CSV
train_df = pd.DataFrame(x_train, columns=df1.drop('Survived', axis=1).columns)
train_df['Survived'] = y_train

test_df = pd.DataFrame(x_test, columns=df1.drop('Survived', axis=1).columns)
test_df['Survived'] = y_test

# File paths
train_csv_path = 'D:/Projects/Python/CodeSoft Internship/Titanic_Survival Project/train.csv'
test_csv_path = 'D:/Projects/Python/CodeSoft Internship/Titanic_Survival Project/test.csv'

# Saving to CSV
train_df.to_csv(train_csv_path, index=False)
test_df.to_csv(test_csv_path, index=False)

print(f"Training data saved to {train_csv_path}")
print(f"Testing data saved to {test_csv_path}")
Training data saved to D:/Projects/Python/CodeSoft Internship/Titanic_Survival Project/train.csv
Testing data saved to D:/Projects/Python/CodeSoft Internship/Titanic_Survival Project/test.csv

8. Building Machine Learning Model 🤖¶

In [46]:
def model_evaluation(classifier, x_test, y_test):
    # Confusion Matrix
    cm = confusion_matrix(y_test, classifier.predict(x_test))
    names = ['True Neg', 'False Pos', 'False Neg', 'True Pos']

    # Format confusion matrix values
    labels = [['{}\n{}'.format(name, value) for name, value in zip(names, row)] for row in cm]

    sns.heatmap(cm, annot=labels, fmt='', annot_kws={"size": 14})
    plt.title('Confusion Matrix')
    plt.show()

    # Classification Report
    print("\nClassification Report:\n", classification_report(y_test, classifier.predict(x_test)))

    # ROC Curve
    plot_roc_curve_custom(classifier, x_test, y_test)

def plot_roc_curve_custom(classifier, x_test, y_test):
    # Calculate ROC curve
    fpr, tpr, _ = roc_curve(y_test, classifier.predict_proba(x_test)[:, 1])

    # Plot ROC curve
    plt.figure(figsize=(8, 8))
    plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve')
    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curve')
    plt.legend(loc="lower right")
    plt.show()

def model(classifier, x_train, y_train, x_test, y_test):
    classifier.fit(x_train, y_train)
    prediction = classifier.predict(x_test)
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

    # Calculate metrics
    accuracy = accuracy_score(y_test, prediction)
    precision = precision_score(y_test, prediction)
    recall = recall_score(y_test, prediction)
    f1 = f1_score(y_test, prediction)
    cross_val_score_mean = cross_val_score(classifier, x_train, y_train, cv=cv, scoring='roc_auc').mean()
    roc_auc = roc_auc_score(y_test, prediction)

    print("Accuracy: {:.2%}".format(accuracy))
    print("Precision: {:.2%}".format(precision))
    print("Recall: {:.2%}".format(recall))
    print("F1 Score: {:.2%}".format(f1))
    print("Cross Validation Score: {:.2%}".format(cross_val_score_mean))
    print("ROC_AUC Score: {:.2%}".format(roc_auc))

    # Evaluation
    model_evaluation(classifier, x_test, y_test)

8.1 | Random Forest Classifier¶

In [56]:
rf = RandomForestClassifier(random_state = 42, n_estimators = 200, max_depth = 4, min_samples_leaf = 2) 
model(rf, x_train, y_train, x_test, y_test)
Accuracy: 82.12%
Precision: 85.00%
Recall: 68.92%
F1 Score: 76.12%
Cross Validation Score: 86.28%
ROC_AUC Score: 80.17%
Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.91      0.86       105
           1       0.85      0.69      0.76        74

    accuracy                           0.82       179
   macro avg       0.83      0.80      0.81       179
weighted avg       0.82      0.82      0.82       179

Predictions of RF¶

In [206]:
# Use the best model for predictions on the test set with selected features
y_pred_rf = rf.predict(x_test)

# Create a new DataFrame with selected features and predictions for RandomForestClassifier
result_rf = pd.DataFrame({ 'Actual': y_test, 'Predicted': y_pred_rf })

# Display the result dataframe
result_rf.head()
Out[206]:
Actual Predicted
709 1 0
439 0 0
840 0 0
720 1 1
39 1 1

8.2 | HistGradientBoostingClassifier¶

In [198]:
hist = HistGradientBoostingClassifier(random_state = 0, max_depth = 4, learning_rate = 0.1)
model(hist, x_train, y_train, x_test, y_test)
Accuracy: 82.68%
Precision: 83.08%
Recall: 72.97%
F1 Score: 77.70%
Cross Validation Score: 85.93%
ROC_AUC Score: 81.25%
Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.90      0.86       105
           1       0.83      0.73      0.78        74

    accuracy                           0.83       179
   macro avg       0.83      0.81      0.82       179
weighted avg       0.83      0.83      0.82       179

Predicting of Hist¶

In [202]:
# Use the best model for predictions on the test set with selected features
y_pred_h = hist.predict(x_test)

# Create a new DataFrame with selected features and predictions for RandomForestClassifier
result_h = pd.DataFrame({ 'Actual': y_test, 'Predicted': y_pred_h })

# Display the result dataframe
result_h.head()
Out[202]:
Actual Predicted
709 1 0
439 0 0
840 0 0
720 1 1
39 1 1

8.3 | Gradient Boosting Classifier¶

In [195]:
gb = GradientBoostingClassifier(random_state = 0, max_depth = 4)
model(gb, x_train, y_train, x_test, y_test)
Accuracy: 81.01%
Precision: 81.25%
Recall: 70.27%
F1 Score: 75.36%
Cross Validation Score: 85.66%
ROC_AUC Score: 79.42%
Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.89      0.85       105
           1       0.81      0.70      0.75        74

    accuracy                           0.81       179
   macro avg       0.81      0.79      0.80       179
weighted avg       0.81      0.81      0.81       179

Model Comparison Insights 📉¶

Hist Boost Classifier¶

  • Accuracy: 82.68%
  • Precision: 83.08%
  • Recall: 72.97%
  • F1 Score: 77.70%
  • Cross Validation Score: 85.93%
  • ROC_AUC Score: 81.25%

Random Forest Classifier¶

  • Accuracy: 82.12%
  • Precision: 85.00%
  • Recall: 68.92%
  • F1 Score: 76.12%
  • Cross Validation Score: 86.28%
  • ROC_AUC Score: 80.17%

GB Classifier¶

  • Accuracy: 81.01%
  • Precision: 81.25%
  • Recall: 70.27%
  • F1 Score: 75.36%
  • Cross Validation Score: 85.66%
  • ROC_AUC Score: 79.42%

Analysis¶

  • Best Model by Accuracy:

    • Model 1 has the highest accuracy at 82.68%.
  • Best Model by Cross Validation Score:

    • Model 2 has the highest cross-validation score at 86.28%.

Conclusion¶

  • If accuracy is the primary metric of interest, Model 1 is the best performing model.
  • If cross-validation score is considered more important, Model 2 should be chosen as the best model.

9. | Making Prediction Using RF Model📝¶

In [28]:
def predict_survival(model, input_data):
    # Convert input data to a list
    input_data_as_list = list(input_data)
    # Reshape the list as we are predicting for only one instance
    input_data_reshaped = [input_data_as_list]
    # Make prediction using the model
    prediction = model.predict(input_data_reshaped)
    # Return the prediction
    return prediction[0]

# Function to take input from the user
def get_user_input():
    pclass = int(input("Enter Passenger Class (1, 2, or 3): "))
    sex = int(input("Enter Sex (0 for female, 1 for male): "))
    age = float(input("Enter Age: "))
    sibsp = int(input("Enter Number of Siblings/Spouses Aboard: "))
    parch = int(input("Enter Number of Parents/Children Aboard: "))
    fare = float(input("Enter Fare: "))
    embarked = int(input("Enter Port of Embarkation (0 for Cherbourg, 1 for Queenstown, 2 for Southampton): "))

    # Convert all input data to a list
    input_data_as_list = [pclass, sex, age, sibsp, parch, fare, embarked]

    return input_data_as_list

input_data = get_user_input()
result = predict_survival(model, input_data)

# Print results
print("\nIndividual 1:", "Survived" if result == 1 else "Not Survived")
Enter Passenger Class (1, 2, or 3): 1
Enter Sex (0 for female, 1 for male): 1
Enter Age: 0.66
Enter Number of Siblings/Spouses Aboard: 1
Enter Number of Parents/Children Aboard: 0
Enter Fare: 0.81
Enter Port of Embarkation (0 for Cherbourg, 1 for Queenstown, 2 for Southampton): 2

Individual 1: Not Survived
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]: